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Abstract 

The paper defends the notion that seman- 
tic tagging should be viewed as more than 
disambiguation between senses. Instead, 
semantic tagging should be a first step 
in the interpretation process by assigning 
each lexical item a representation of all 
of its systematically related senses, from 
which further semantic processing steps 
can derive discourse dependent interpre- 
tations. This leads to a new type of se- 
mantic lexicon (CoreLex) that supports 
underspecified semantic tagging through 
a design based on systematic polysemous 
classes and a class-based acquisition of lex- 
ical knowledge for specific domains. 



1 Underspecified semantic tagging 

Semantic tagging has mostly been considered as 
nothing more than disambiguation to be performed 
along the same lines as part-of-speech tagging: given 
n lexical items each with m senses apply linguis- 
tic heuristics and/or statistical measures to pick 
the most likely sense for each lexical item (see eg: 
(Yarowsky, 1992) (Stevenson and Wilks, 1997)). 

I do not believe this to be the right approach because 
it blurs the distinction between 'related' {systematic 
polysemy) and 'unrelated' senses (homonymy : bank 
- bank). Although homonyms need to be tagged with 
a disambiguated sense, this is not necessarily so in 
the case of systematic polysemy. There are two rea- 
sons for this that I will discuss briefly here. 

First, the problem of multiple reference. Consider 
this example from the brown corpus: 

[A long book heavily weighted with 
military technicalities] jvp, in this edi- 
tion it is neither so long nor so technical as 
it was originally. 



The discourse marker (it) refers back to an NP that 
expresses more than one interpretation at the same 
time. The head of the NP (book) has a number 
of systematically related senses that are being ex- 
pressed simultaneously. The meaning of book in this 
sentence cannot be disambiguated between the num- 
ber of interpretations that are implied: the informa- 
tional content of the book (military technicali- 
ties), its physical appearance (heavily weighted) 
and the events that are involved in its construction 
and use (long). 

The example illustrates the fact that disambigua- 
tion between related senses is not always possible, 
which leads to the further question if a discrete dis- 
tinction between such senses is desirable at all. A 
number of researchers have answered this question 
negatively (see eg: (Pustejovsky, 1995) (Killgariff, 
1992)). Consider these examples from BROWN: 



(1) 


fast 


run-up (of the stock) 


(2) 


fast 


action (by the city government) 


(3) 


fast 


footwork (by Washington) 


(4) 


fast 


weight gaining 


(5) 


fast 


condition (of the track) 


(6) 


fast 


response time 


(7) 


fast 


people 


(8) 


fast 


ball 



Each use of the adjective 'fast' in these examples has 
a slightly different interpretation that could be cap- 
tured in a number of senses, reflecting the different 
syntactic and semantic patterns. For instance: 

1. 'a fast action' (1, 2, 3, 4) 

2. 'a fast state of affairs' (5, 6) 

3. 'a fast object' (7, 8) 

On the other hand all of the interpretations have 
something in common also, namely the idea of 
'speed'. It seems therefore useful to underspecify 
the lexical meaning of 'fast' to a representation that 
captures this primary semantic aspect and gives a 
general structure for its combination with other lex- 
ical items, both locally (in compositional semantics) 
and globally (in discourse structure). 



Both the multiple reference and the sense enumer- 
ation problem show that lexical items mostly have 
an indefinite number of related but highly discourse 
dependent interpretations, between which cannot be 
distinguished by semantic tagging alone. Instead, se- 
mantic tagging should be a first step in the interpre- 
tation process by assigning each lexical item a repre- 
sentation of all of its systematically related 'senses'. 
Further semantic processing steps derive discourse 
dependent interpretations from this representation. 
Semantic tags are therefore more like pointers to 
complex knowledge representations, which can be 
seen as underspecified lexical meanings. 

2 CoreLex: a Semantic Lexicon 
with Systematic Polysemous 
Classes 

In this section I describe the structure and content 
of a lexicon (CoreLex) that builds on the assump- 
tions about lexical semantics and discourse outlined 
above. More specifically, it is to be 'structured in 
such a way that it reflects the lexical semantics 
of a language in systematic and predictable ways' 
(Pustejovsky, Boguraev, and Johnston, 1995). This 
assumption is fundamentally different from the de- 
sign philosophies behind existing lexical semantic re- 
sources like WordNet that do not account for any 
regularities between senses. For instance, WORD- 
Net assigns to the noun book the following senses: 



publication 
product, production 
fact 

dramatic_composition, dramatic_work 
record 

section, subdivision 
journal 



Figure 1: WordNet senses for the noun book 

At the top of the WordNet hierarchy these seven 
senses can be reduced to two unrelated 'basic senses': 
the content that is being communicated (commu- 
nication) and the medium of communication (ar- 
tifact). More accurately, book should be assigned a 
qualia structure which implies both of these interpre- 
tations and connects them to each of the more spe- 
cific senses that WordNet assigns: that is, facts, 
drama and a journal can be part-of the content of a 
book; a section is part-of both the content and the 
medium; publication, production and record- 
ing are all events in which both the content and the 
medium aspects of a book can be involved. 

An important advantage of the CoreLex approach 
is more consistency among the assignments of lex- 



ical semantic structure. Consider the senses that 
WordNet assigns to door, gate and window: 



Figure 2: WordNet senses for the nouns door, 
window and gate 

Obviously these arc similar words, something which 
is not expressed in the WordNet sense assign- 
ments. In the CoreLex approach, these nouns are 
given the same semantic type, which is underspeci- 
fied for any specific 'sense' but assigns them consis- 
tently with the same basic lexical semantic structure 
that expresses the regularities between all of their 
interpretations. 

However, despite its shortcomings WordNet is a 
vast resource of lexical semantic knowledge that can 
be mined, restructured and extended, which makes 
it a good starting point for the construction of 
CoreLex. The next sections describe how system- 
atic polysemous classes and underspecified semantic 
types can be derived from WordNet. In this pa- 
per I only consider classes of nouns, but the process 
described here can also be applied to other parts of 
speech. 

2.1 Systematic polysemous classes 

We can arrive at classes of systematically poly- 
semous lexical items by investigating which items 
share the same senses and are thus polysemous in 
the same way. This comparison is done at the top 
levels of the WordNet hierarchy. WordNet does 
not have an explicit level structure, but for the pur- 
pose of this research one can distinguish a set of 32 
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movable_barrier 




artifact 




entrance 




opening 




access 




cognition. 


knowledge 


house 
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room 




?? 
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movable_barrier 




artifact 




computer _circuit 
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grossJncome 




opening 




window 








opening 




opening 




pemel 




artifact 




display 




cognition. 


knowledge 



'basic senses' that partly coincides with, but is not 
based directly on WordNet's Hst of 26 'top types': 

act (act), agent (agt), animal (aum), 
artifact (art), attribute (atr), blun- 
der (bin), cell (eel), chemical (chm), 
communication (com), 
event (evt), food (fod), form (frm), 
group .biological (grb), group (grp), 
group .social (grs), human (hum), lin- 
ear .measure (ime), location (loc), lo- 
cation.geographical (log), measure 
(mea), naturaLobject (nat), phe- 
nomenon (phm), plant (pit), posses- 
sion (pos), part (prt), psychological 
(psy), quantity .definite (qud), quan- 
tity .indefinite (qui), relation (rel), 
space (spc), state (sta), time (tme) 

Figure ^ shows their distribution among noun stems 
in the BROWN corpus. For instance there are 2550 
different noun stems (with 49,824 instances) that 
have each 2 out of the 32 'basic senses' assigned to 
them in 238 different combinations (a subset of 32^^ 
— 1024 possible combinations). 



senses 


comb's 


stems 


instances 


2 


238 


2550 


49824 


3 


379 


936 


35608 


4 


268 


347 


22543 


5 


148 


154 


15345 


6 


52 


52 


5915 


7 


27 


27 


5073 


8 


10 


10 


3273 


9 


3 


3 


1450 


10 


1 


1 


483 


11 


2 


2 


959 


12 


1 


1 


441 




1161 


10797 


140914 



Figure 3: Polysemy of nouns in brown 



We now reduce all of WordNet's sense assignments 
to these basic senses. For instance, the seven differ- 
ent senses that WordNet assigns to the lexical item 
book (see Figure |l| above) can be reduced to the two 
basic senses: 'art com' . We do this for each lexical 
item and then group them into classes according to 
their assignments. 

From these one can filter out those classes that have 
only one member because they obviously do not rep- 
resent a systematically polysemous class. The lexical 
items in those classes have a highly idiosyncratic be- 
havior and are most likely homonyms. This leaves 



a set of 442 polysemous classes, of which Figure 
gives a selection: 



Figure 4: A selection of polysemous classes 



Not all of the 442 classes are systematically polyse- 
mous. Consider for example the following classes: 



act 


anm art 


drill ruff solitaire stud 


act 


log 


bolivia caliphate Charleston 






Chicago clearing emirate michigan 






prefecture repair Santiago wheeling 


act 


pit 


chess grapevine rape 


art 


fod loc 


pike port 


chm 


psy 


complex incense 


fod hum pit 


mandarin sage swede 



Figure 5: A selection of ambiguous classes 



Some of these classes are collections of homonyms 
that are ambiguous in similar ways, but do not lead 
to any kind of predictable polysemous behavior, for 
instance the class ' act aim art ' with the lexical 
items: drill ruff solitaire stud. Other classes con- 
sist of both homonyms and systematically polyse- 
mous lexical items like the class act log, which in- 
cludes caliphate, clearing, emirate, prefecture, repair, 
wheeling vs. bolivia, Charleston, Chicago, michigan. 
Whereas the first group of nouns express two sepa- 
rated but related meanings (the act of clearing, re- 
pair, etc. takes place at a certain location), the 
second group expresses two meanings that are not 
related (the Charleston dance which was named after 
the town by the same name). 

The ambiguous classes need to be removed alto- 
gether, while the ones with mixed ambiguous and 
polysemous lexical items are to be weeded out care- 
fully. 



act 


art 


evt rel 


click modification reverse 


act 


art 


log 


berth habitation mooring 


act 


evt 


nat 


ascent climb 


chm 


sta 




grease ptomaine 


com 


prt 




appendix brickbat index 


frm 


sta 




Qnlirl \/pr3nr\/ \/niH 


Ime 


qud 




em fathom fthm inch mil 


loc 


psy 




bourn bourne demarcation 
fairyland rubicon trend vertex 


log 


pos 


sta 


barony province 


phm pos 




accretion usance wastage 


rel 


sta 




baronetcy connectedness 
context efficiency inclusion 
liquid relationship 



2.2 Under specified semantic types 

The next step in the research is to organize the re- 
maining classes into knowledge representations that 
relate their senses to each other. These representa- 
tions are based on Generative Lexicon theory (GC), 
using qualia roles and (dotted) types (Pustejovsky, 
1995). 

Qualia roles distinguish different semantic aspects: 
FORMAL indicates semantic type; constitutive 
part-whole information; AGENTIVE and telic asso- 
ciated events (the first dealing with the origin of 
the object, the second with its purpose). Each role 
is typed to a specific class of lexical items. Types 
are either simple (human, artifact,...) or complex 
(e.g., information»physical). Complex types are 
called dotted types after the 'dots' that are used as 
type constructors. Here I introduce two kinds of 
dots: 

Closed dots '•' connect systematically re- 
lated types that are always interpreted si- 
multaneously. 

Open dots 'o' connect systematically re- 
lated types that are not (normally) inter- 
preted simultaneously. 

Both 'CT»r' and 'ctot' denote sets of pairs of objects 
(a, 6), a an object of type a and b an object of type 
T. A condition oRb restricts this set of pairs to only 
those for which some relation R holds, where R de- 
notes a subset of the Cartesian product of the sets 
of type a objects and type r objects. 

The difference between types 'cr»r' and 'cror' is in 
the nature of the objects they denote. The type 
'cr»r' denotes sets of pairs of objects where each 
pair behaves as a complex object in discourse struc- 
ture. For instance, the pairs of objects that are in- 
troduced by the type information»physical (book, 
journal, scoreboard, ...) are addressed as the complex 
objects (x:information, y:physical) in discourse. 
On the other hand, the type 'cror' denotes simply 
a set of pairs of objects that do not occur together 
in discourse structure. For instance, the pairs of ob- 
jects that are introduced by the type form»artifact 
(door, gate, window, ...) are not (normally) ad- 
dressed simultaneously in discourse, rather one side 
of the object is picked out in a particular context. 
Nevertheless, the pair as a whole remains active dur- 
ing processing. 

The resulting representations can be seen as under- 
specified lexical meanings and are therefore referred 
to as underspecified semantic types. CoreLex cur- 
rently covers 104 underspecified semantic types. 
This section presents a number of examples, for a 
complete overview see the CoreLex webpage: 



Closed Dots Consider the underspecified repre- 
sentation for the semantic type act»relation: 



formal = Q:act»relation 

CONSTITUTIVE = 

X:act V Y:relation V Z:act»relation 

TELIC = 

P:event(act»relation) A act(Ri) A 
relation(R2,R3) 

Figure 6: Representation for type: act»relation 



The representation introduces a number of objects 
that are of a certain type. The formal role in- 
troduces an object Q of type act»relation. The 
CONSTITUTIVE introduces objects that are in a part- 
whole relationship with Q. These are either of the 
same type act»relation or of the simple types act or 
relation. The telic expresses the event P that can 
be associated with an object of type act»relation. 
For instance, the event of increase as in 'increasing the 
communication between member states' implies 'in- 
creasing' both the act of communicating an object 
Ri and the communication relation between two 
objects R2 and R3. All these objects are introduced 
on the semantic level and correspond to a number 
of objects that will be realized in syntax. However, 
not all semantic objects will be realized in syntax. 
(See Section B.4 for more on the syntax-semantics 
interface.) 



The instances for the type act»relation are given 
in Figure |^, covering three different systematic pol- 
ysemous classes. We could have chosen to include 
only the instances of the 'act rel' class, but the 
nouns in the other two classes seem similar enough 
to describe all of them with the same type. 



act 


evt rel 


blend competition flux 






transformation 


act 


rel 


acceleration communication 






dealings designation discourse gait 






glide likening negation neologism 






neology prevention qualifying 






sharing synchronisation 






synchronization synchronizing 


act 


rel sta 


coordination gradation involvement 



tittp: / /www. cs. brandcis.edu/ ~paulb/CorcLcx/corclcx.html 



Figure 7: Instances for the type: act»relation 



Open Dots The type act»relation describes in- 
terpretations that can not be separated from each 
other (the act and relation aspects are intimately 
connected). The following representation for type 
animalofood describes interpretations that can not 



occur simultaneously but are however relatedEI. It 
therefore uses a 'o' instead of a '•' as a type con- 
structor: 

FORMAL = Q:animalofood 
CONSTITUTIVE = Xianimal V Y:food 

TELIC — 

Pi:act(Ri, animal) V P2:act (animal, R2) 
V P3:act(R3,food) 

Figure 8: Representation for type: animalofood 

The instances for this type only cover the class ' anm 
f od' . A case could be made for including also every 
instance of the class ' aiiin ' because in principal every 
animal could be eaten. This is a question of how 
generative the lexicon should be and if one allows 
overgeneration of semantic objects. 



anm f od bluepoint capon clam cockle crawdad 
crawfish crayfish duckling fowl 
grub hen lamb langouste limpet 
lobster monkfish mussel octopus panfish 
partridge pheasant pigeon poultry 
prawn pullet quail saki scallop 
scollop shellfish shrimp snail 
squid whelk whitebait whitefish winkle 



Figure 9: Instances for the type: animalofood 



2.3 Homonyms 

CoreLex is designed around the idea of system- 
atic polysemous classes that exclude homonyms. 
Traditionally a lot of research in lexical semantics 
has been occupied with the problem of ambiguity 
in homonyms. Our research shows however that 
homonyms only make up a fraction of the whole of 
the lexicon of a language. Out of the 37,793 noun 
stems that were derived from WordNet 1637 are 
to be viewed as true homonyms because they have 
two or more unrelated senses, less than 5%. The re- 
maining 95% are nouns that do have (an indefinite 
number of) different interpretations, but all of these 
are somehow related and should be inferred from a 
common knowledge representation. These numbers 
suggest a stronger emphasis in research on system- 
atic polysemy and less on homonyms, an approach 
that is advocated here (see also (Killgariff, 1992)). 

In CoreLex homonyms are simply assigned two or 
more underspecified semantic types, that need to be 



See the literature on animal grinding, for instance 
(Copestake and Briscoe, 1992) 



disambiguated in a traditional way. There is how- 
ever an added value also here because each disam- 
biguated type can generate any number of context 
dependent interpretations. 

3 Adapting CoreLex to Domain 
Specific Corpora 

The underspecified semantic type that CoreLex as- 
signs to a noun provides a basic lexical semantic 
structure that can be seen as the class-wide back- 
bone semantic description on top of which specific 
information for each lexical item is to be defined. 

That is, doors and gates are both artifacts but they 
have different appearances. Gates are typically open 
constructions, whereas doors tend to be solid. This 
kind of information however is corpus specific and 
therefore needs to be adapted specifically to and on 
the basis of that particular corpus of texts. 

This process involves a number of consecutive steps 
that includes the probabilistic classification of un- 
known lexical items: 

1. Assignment of underspecified semantic tags to 
those nouns that are in CoreLex 

2. Running class-sensitive patterns over the 
(partly) tagged corpus 

3. (a) Constructing a probabilistic classifier from 

the data obtained in step 2. 

(b) Probabilistically tag nouns that are not in 
CoreLex according to this classifier 

4. Relating the data obtained in step 2. to one or 
more qualia roles 

Step 1. is trivial, but steps 2. through 4. form 
a complex process of constructing a corpus specific 
semantic lexicon that is to be used in additional 
processing for knowledge intensive reasoning steps 
(i.e. abduction (Hobbs et al., 1993)) that would solve 
metaphoric, metonymic and other non-literal use of 
language. 

3.1 Assignment of CoreLex Tags 

The first step in analyzing a new corpus involves 
tagging each noun that is in CoreLex with an un- 
derspecified semantic tag. This tag represents the 
following information: a definition of the type of 
the noun (formal); a definition of types of pos- 
sible nouns it can stand in a part-whole relation- 
ship with (constitutive); a definition of types of 
possible verbs it can occur with and their argument 
structures (agentive / telic). CoreLex is imple- 
mented as a database of associative arrays, which 
allows a fast lookup of this information in pattern 
matching. 



3.2 Class-Sensitive Pattern Matching 

The pattern matcher runs over corpora that are: 
part-of-speech tagged using a widely used tagger 
(Brill, 1992); stemmed by using an experimental sys- 
tem that extends the Porter stemmer, a stemming 
algorithm widely used in information retrieval, with 
the Celex database on English morphology; (partly) 
semantically tagged using the CoreLex set of un- 
derspecified semantic tags as discussed in the previ- 
ous section. 

There are about 30 different patterns that are ar- 
ranged around the headnoun of an NP. They cover 
the following syntactic constructions that roughly 
correspond to a VP, an S, an NP and an NP fol- 
lowed by a PP: 

• verb-headnoun 

• headnoun-verb 

• adjective-headnoun 

• modifiernoun-headnoun 

• headnoun-preposition-headnoun 

The patterns assume NP's of the following generic 
structuraa: 

PreDet* Det* Num* (Adj I Name I Noun) * Noun 

The heuristics for finding the headnoun is then sim- 
ply to take the rightmost noun in the NP, which for 
English is mostly correct. 

The verb-headnoun patterns approach that of a 
true 'verb-obj' analysis by including a normalization 
of passive constructions as follows: 

[Noun Have? Be Adv? Verb] ^ [Verb Noun] 

Similarly, the headnoun-verb patterns approach 
a true 'subj-verb' analysis. However, because no 
deep syntactic analysis is performed, the patterns 
can only approximate subjects and objects in this 
way and 1 therefore do not refer to these patterns as 
'subject- verb' and 'verb-object' respectively. 

The pattern matching is class-sensitive in employing 
the assigned CoreLex tag to determine if the appli- 
cation of this pattern is appropriate. For instance, 
one of the headnoun-preposition-headnoun pat- 
terns is the following, that is used to detect part- 
whole (constitutive) relations: 

PreDet* Det* Num* (Adj I Name I Noun) * Noun of 
PreDet* Det* Num* (Adj I Name I Noun) * Noun 

Clearly not every syntactic construction that fits this 
pattern is to be interpreted as the expression of a 

^The interpretation of '*' and '?' in this section fol- 
lows that of common usage in regular expressions: '*' 
indicates or more occurrences; '?' indicates or 1 
occurrence 



part-whole relation. One of the heuristics we there- 
fore use is that the pattern may only apply if both 
head nouns carry the same CoreLex tag or if the 
tag of the second head noun subsumes the tag of the 
first one through a dotted type. That is, if the sec- 
ond head noun is of a dotted type and the first is of 
one of its composing types. For instance, 'paragraph' 
and 'journal' can be in a part-whole relation to each 
other because the first is of type information, while 
the second is of type information»physical. Simi- 
lar heuristics can be identified for the application of 
other patterns. 

Recall of the patterns (percentage of nouns that 
are covered) is on average, among different cor- 
pora (wsj, brown, pdgf - a corpus we constructed 
for independent purposes from 1000 medical ab- 
stracts in the MEDLINE database on Platelet Derived 
Growth Factor - and DARWiN - the complete Origin 
of Species), about 70% to 80%. Precision is much 
harder to measure, but depends both on the accu- 
racy of the output of the part-of-speech tagger and 
on the accuracy of class-sensitive heuristics. 

3.3 Probabilistic Classification 

The knowledge about the linguistic context of nouns 
in the corpus that is collected by the pattern matcher 
is now used to classify unknown nouns. This involves 
a similarity measure between the linguistic contexts 
of classes of nouns that are in CoreLex and the 
linguistic context of unknown nouns. For this pur- 
pose the pattern matcher keeps two separate arrays, 
one that collects knowledge only on CoreLex nouns 
and the other collecting knowledge on all nouns. 

The classifier uses mutual information (MI) scores 
rather than the raw frequences of the occurring pat- 
terns (Church and Hanks, 1990). Computing MI 
scores is by now a standard procedure for measuring 
the co-occurrence between objects relative to their 
overall occurrence. Ml is defined in general as fol- 
lows: 



I {x y) = log2 



P{x y) 
P{x) P{y) 



We can use this definition to derive an estimate of 
the connectedness between words, in terms of collo- 
cations (Smadja, 1993), but also in terms of phrases 
and grammatical relations (Hindle, 1990). For in- 
stance the co-occurrence of verbs and the heads of 
their NP objects (A^: size of the corpus, i.e. the num- 
ber of stems): 



Cob] {v n) log2 



N 



m. 

N 



fin) 
N 



All nouns are now classified by running a simi- 
larity measure over their MI scores and the MI 
scores of each CoreLex class. For this we use the 
Jaccard measure that compares objects relative to 



the attributes they share (Grefenstette, 1994). In 
our case the 'attributes' are the different linguistic 
constructions a noun occurs in: headnoun-verb, 
adjective- headnoun, modifiernoun-headnoun, 
etc. 

The Jaccard measure is defined as the number of 
attributes shared by two objects divided by the total 
number of unique attributes shared by both objects: 



A + B 



C 



A 
B 
C 



attributes shared by both objects 
attributes unique to object 1 
attributes unique to object 2 



The Jaccard scores for each CoreLex class are 
sorted and the class with the highest score is as- 
signed to the noun. If the highest score is equal to 
0, no class is assigned. 

The classification process is evaluated in terms of 
precision and recall figures, but not directly on the 
classified unknown nouns, because their precision is 
hard to measure. Rather we compute precision and 
recall on the classification of those nouns that are in 
CoreLex, because we can check their class automati- 
cally. The assumption then is that the precision and 
recall figures for the classification of nouns that are 
known correspond to those that are unknown. An 
additional measure of the effectiveness of the clas- 
sifier is measuring the recall on classification of all 
nouns, known and unknown. This number seems to 
correlate with the size of the corpus, in larger cor- 
pora more nouns are being classified, but not nec- 
essarily more correctly. Correct classification rather 
seems to depend on the homogeneity of the corpus: 
if it is written in one style, with one theme and so 
on. 

Recall of the classifier (percentage of all nouns that 
are classified > 0) is on average, among different 
larger corpora (> 100,000 tokens), about 80% to 
90%. Recall on the nouns in CoreLex is between 
35% and 55%, while precision is between 20% and 
40%. The last number is much better on smaller cor- 
pora (70% on average). More detailed information 
about the performance of the classifier, matcher and 
acquisition tool (see below) can be obtained from 
(Buitelaar, forthcoming). 

3.4 Lexical Knowledge Acquisition 

The final step in the process of adapting CoreLex 
to a specific domain involves the 'translation' of ob- 
served syntactic patterns into corresponding seman- 
tic ones and generating a semantic lexicon represent- 
ing that information. 

There are basically three kinds of semantic patterns 
that are utilized in a CoreLex lexicon: hyponymy 



(sub-supertype information) in the FORMAL role, 
meronymy (part-whole information) in the consti- 
tutive role and predicate-argument structure in the 
TELIC and AGENTIVE rolcs. There are no compelling 
reasons to exclude other kinds of information, but 
for now we base our basic design on QC , which only 
includes these three in its definition of qualia struc- 
ture. 

Hyponymic information is acquired thro ugh the clas- 
sification process discussed in Sections 2.2 and p73. 



Meronymic information is obtained through a trans- 
lation of various VP and PP patterns into 'has-part' 
and 'part-of relations. Predicate-argument struc- 
ture finally, is derived from verb- headnoun and 
headnoun-verb constructions. 

The semantic lexicon that is generated in such a 
way comes in two formats: TI?£, a Type De- 
scription Language based on typed feature-logic 
(Krieger and Schaefer, 1994a) (Krieger and Schae- 
fer, 1994b) and HTML, the markup language for the 
World Wide Web. The first provides a constraint- 
based formalism that allows CoreLex lexicons to 
be used straightforwardly in constraint-based gram- 
mars. The second format is used to present a gen- 
erated semantic lexicon as a semantic index on a 
World Wide Web document. We will not elaborate 
on this further because the subject of semantic in- 
dexing is out of the scope of this paper, but we refer 
to (Pustejovsky et al., 1997). 

3.5 An Example: The pdgf Lexicon 

The semantic lexicon we generated for the pdgf 
corpus covers 1830 noun stems, spread over 81 
CoreLex types. For instance, the noun evidence 
is of type communication»psychological and the 

following representation is generated: 

evidence 



FORMAL = 

CLOSED = 
CONSTITUTIVE = 
HAS-PART — 



ARGl = communication 
ARC 2 = psychological 



FIRST = structure 

REST — ... 



TELIC = 

FIRST = 



provide 

ARG-STRUCT = 



REST = ... 

Figure 10: Lexical entry for: evidence 



4 Conclusion 

In this paper I discuss the construction of a new 
type of semantic lexicon that supports underspeci- 
fied semantic tagging. Traditional semantic tagging 
assumes a number of distinct senses for each lexical 
item between which the system should choose. Un- 
derspccificd semantic tagging however assumes no 
finite lists of senses, but instead tags each lexical 
item with a comprehensive knowledge representa- 
tion from which a specific interpretation can be con- 
structed. CoreLex provides such knowledge rep- 
resentations, and as such it is fundamentally differ- 
ent from existing semantic lexicons like WordNet. 
Additionally, it was shown that CoreLex provides 
for more consistent assignments of lexical semantic 
structure among classes of lexical items. Finally, 
the approach described above allows one to gener- 
ate domain specific semantic lexicons by enhancing 
CoRELEX lexical entries with corpus based informa- 
tion. 
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